The COVID-19 pandemic has had a huge effect on people’s lives, both socially and economically, with varying severity based on different characteristics between populations. It has been observed in media reports that the COVID-19 pandemic has bigger impact on older populations, populations with higher percentages of black and hispanic people, and populations with lower income. We explore the relationship between COVID-19 death rate in US counties and their socioeconomic characteristics. Potentially relevant variables are examined both graphically and numerically. We include state as one of the variables in all of our analyses to account for state related variability, so that we can concentrate on the effect of the county level social-economical variables. We find that the data confirm the observations made in the media reports about disadvantaged populations. We use statistical machine learning methods to predict county death rate based on the income, jobs, people, and county classification information of the county, and compare the methods based on test data prediction accuracy. The methods we use are linear model with stepwise variable selection, LASSO and relaxed LASSO, Random Forest, Boosting, and deep learning networks. The results of these prediction methods can provide valuable guidance on issues such as resource allocation in the future.
The outbreak of the coronavirus disease 2019 (COVID-19) has been declared a global emergency by the World Health Organization (WHO). It has had over 30 million reported cases worldwide, with more than one million deaths. There have been over seven million cases and 200,000 deaths in the US alone. In response, governments have implemented travel restrictions, business and school closures, and other social distancing policies. The pandemic has affected every aspect of human society.
The impact of COVID-19 on a population can vary based on different characteristics of the population. There are reports of different infection rates and fatality rates among different age groups, racial groups, and income groups. One of the most important and direct measurements of the impact is the death rate: the percentage of people that die of the disease out of the total population. In this project, we study the relationship between COVID-19 death rate in a county and its socioeconomic characteristics. We obtained county-level data on the income, jobs, people, and county classifications, as well as the cumulative infection and fatality numbers for each county in the US. We concentrate on the fatality data rather than the infection data because fatality is more well-defined and does not depend upon factors like whether tests are widely available in the area.
The socioeconomic county-level data includes a large number of variables. After removing redundant and/or irrelevant variables, we included 41 potential explanatory variables in our analysis. We explore the variables with statistical plots and maps, and study their relation to the county death rate. It is clear that other variables, such as the phase of the pandemic the state is going through, and the lockdown/social distancing policies of the state, can also strongly relate to the county death rate. In order to account for differences between states and concentrate on the effect of county level socioeconomic variables, it is important that we include state as an explanatory variable in our analysis, as it helps to control for state related variability such as timeline and state policy.
We explore the relationship of the explanatory variables with the county death rate graphically and numerically. We find evidence in the data that confirms the observations made in media reports that the COVID-19 death rate is higher in older populations, populations with higher percentages of black and hispanic people, and populations with lower income. We use machine learning methods to build models to predict county death rates based on county social-economical characteristics, and compare the models based on test data prediction accuracy. The methods include linear model with stepwise variable selection, LASSO and relaxed LASSO, Random Forests, boosting, and deep learning neural networks. While this analysis only provides association and prediction, not cause and effect, the results can provide valuable guidance on issues such as resource allocation in the case of potential future waves of similar diseases.
The data comes from two different sources:
A detailed list of the variables is in the appendix.
We load in the county level income, jobs, people, and county classification data. For consistency, we rename FIPStxt column of county.class to FIPS. The county FIPS code uniquely identifies each county in the US.
We merge all of the county level socioeconomic datasets into one dataset according to FIPS. We call the combined dataset countydata.
We look at all counties in the Continental US, taking out Hawaii, Alaska, and Puerto Rico not only because they are not part of the Continental US, but there are also many missing values. We are also removing Bedford, VA, FIPS 51515 because it had a FIPS change and the new FIPS is 51019.
We now load the county level confirmed COVID-19 infection and fatality numbers data. These are daily cumulative numbers. Counties with no confirmed COVID-19 infections were not included in the data. The following plot gives the total number of counties with confirmed cases by day.
We take the cumulative data on 8/19/2020 (the date when we downloaded the dataset), for further analysis. Since each state is going through a different phase of the pandemic, and has a different lockdown and reopening timeline and different state policies, it is important to include the state as a control variable in our analysis, so we can put counties in different states on equal footing and concentrate on the effect of county level socioeconomic variables.
There are 29 records in the COVID-19 data with unknown county; we discard them. We look at all counties in the Continental US, taking out Hawaii, Alaska, and Puerto Rico.
There are also three special records in the COVID-19 data: New York City, NY; Joplin, MO; and Kansas City, MO.
New York City has five boroughs that each count as a county in the countydata, yet the COVID-19 data gives data only on New York City in its entirety. To make the COVID-19 data and the countydata consistent in their treatment of New York City, we remove the New York City entry from the COVID-19 data and add the data for five boroughs. We manually find the number of deaths and infections for each of the five boroughs around 08/19/2020 and input them into our data.
Joplin and Kansas City also consist of multiple counties (Joplin has Jasper and Newton; Kansas City has Jackson, Clay, Platte, and Cass), but the COVID-19 data already contains data on the counties. So we just remove Joplin and Kansas City from the COVID-19 data.
We merge the countydata and the COVID-19 data by FIPS. The combined data has 3109 US continental counties and 211 variables.
Now we go through the county socioeconomic variables. We remove the variables that would clearly not be relevant. For variables that were highly correlated and close in meaning, we choose only one. For example, we remove TotalHH (the total number of households in a county) because it is 0.996 correlated with TotalPopEst2019 (the total population). We also remove MedHHInc in favor of PerCapitaInc, and HH65PlusAlonePct in favor of Age65AndOlderPct2010.
We only take the latest version of each variable; for example, for unemployment rate, there is UnempRate2010, UnempRate2011… we remove the earlier years and only take UnempRate2019.
Sometimes a group of variables always adds up to 1. So we remove one of those variables. Example: Ed1LessThanHSPct, Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, and Ed5CollegePlusPct add up to 1, so we remove Ed1LessThanHSPct (percentage of population with education level less than high school) from our variable set.
Many of the variables from the county classification data are categorical, but the other county datasets already have continuous variables for them. For instance, county.people has different education levels, but county.class has Low_Education_2015_update, which classifies a county as low-education. So we remove Low_Education_2015_update since it is less informative. Similarly, UrbanInfluenceCode2013 provides more refined information than each of Noncore2013, Micropolitan2013, Nonmetro2013, Metro2013, Metro_Adjacent2013. So we remove the latter variables.
Since TotalPop25Plus is total number of people 25 and over, we create the more relevant variable of the percentage of the population that are 25 and over. Because TotalPop25Plus is an average over 5 years, we take the average of TotalPopEst from 2014 to 2018 as the denominator when calculating the percentage.
We fill in the missing values for infections and deaths. Missing counties in COVID-19 data simply means that the county has not yet had any confirmed cases (as of 8/19/2020).
Here is a summary of all the remaining variables. There are still a very small number of missing values; we will remove the counties with missing values before we move on to data modeling, but these counties are kept for now for the sake of the graphs of the variables that do not involve missing values.
## County state fips cases
## Length:3109 Length:3109 Min. : 1001 Min. : 0
## Class :character Class :character 1st Qu.:19043 1st Qu.: 73
## Mode :character Mode :character Median :29211 Median : 260
## Mean :30666 Mean : 1757
## 3rd Qu.:46007 3rd Qu.: 872
## Max. :56045 Max. :225827
##
## deaths Deep_Pov_All PovertyAllAgesPct PerCapitaInc
## Min. : 0.00 Min. : 0.000 Min. : 2.60 Min. :10148
## 1st Qu.: 1.00 1st Qu.: 4.469 1st Qu.:10.90 1st Qu.:22750
## Median : 4.00 Median : 6.109 Median :14.20 Median :26216
## Mean : 53.85 Mean : 6.685 Mean :15.18 Mean :26980
## 3rd Qu.: 20.00 3rd Qu.: 8.038 3rd Qu.:18.30 3rd Qu.:29986
## Max. :5981.00 Max. :33.183 Max. :54.00 Max. :72832
## NA's :1 NA's :1
## UnempRate2019 PctEmpFIRE PctEmpConstruction PctEmpTrans
## Min. : 0.700 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 3.347 1st Qu.: 5.786 1st Qu.: 4.167
## Median : 3.700 Median : 4.283 Median : 7.051 Median : 5.238
## Mean : 3.963 Mean : 4.569 Mean : 7.339 Mean : 5.512
## 3rd Qu.: 4.600 3rd Qu.: 5.436 3rd Qu.: 8.581 3rd Qu.: 6.498
## Max. :18.300 Max. :20.603 Max. :25.532 Max. :22.487
## NA's :1 NA's :1 NA's :1
## PctEmpMining PctEmpTrade PctEmpInformation PctEmpAgriculture
## Min. : 0.00000 Min. : 0.838 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.08058 1st Qu.:12.248 1st Qu.: 0.8615 1st Qu.: 1.154
## Median : 0.30194 Median :13.840 Median : 1.2860 Median : 2.866
## Mean : 1.58556 Mean :13.695 Mean : 1.3678 Mean : 5.089
## 3rd Qu.: 1.27458 3rd Qu.:15.277 3rd Qu.: 1.7308 3rd Qu.: 6.294
## Max. :44.03561 Max. :38.889 Max. :12.3288 Max. :59.649
## NA's :1 NA's :1 NA's :1 NA's :1
## PctEmpManufacturing PctEmpServices PctEmpGovt PopDensity2010
## Min. : 0.000 Min. : 8.333 Min. : 0.000 Min. : 0.12
## 1st Qu.: 6.858 1st Qu.:38.291 1st Qu.: 3.513 1st Qu.: 17.63
## Median :11.434 Median :42.768 Median : 4.733 Median : 45.64
## Mean :12.343 Mean :42.994 Mean : 5.505 Mean : 264.36
## 3rd Qu.:16.724 3rd Qu.:47.539 3rd Qu.: 6.489 3rd Qu.: 115.04
## Max. :48.024 Max. :81.589 Max. :33.500 Max. :69468.42
## NA's :1 NA's :1 NA's :1
## OwnHomePct Age65AndOlderPct2010 Over25Pct2018 Under18Pct2010
## Min. :19.61 Min. : 3.73 Min. :0.3847 Min. : 9.11
## 1st Qu.:67.71 1st Qu.:13.19 1st Qu.:0.6662 1st Qu.:21.42
## Median :72.67 Median :15.60 Median :0.6911 Median :23.31
## Mean :71.53 Mean :15.95 Mean :0.6887 Mean :23.41
## 3rd Qu.:77.06 3rd Qu.:18.25 3rd Qu.:0.7154 3rd Qu.:25.09
## Max. :92.40 Max. :43.38 Max. :0.9600 Max. :40.13
##
## Ed2HSDiplomaOnlyPct Ed3SomeCollegePct Ed4AssocDegreePct Ed5CollegePlusPct
## Min. : 5.47 Min. : 4.116 Min. : 1.116 Min. : 0.00
## 1st Qu.:29.81 1st Qu.:19.197 1st Qu.: 7.148 1st Qu.:14.99
## Median :34.57 Median :21.664 Median : 8.663 Median :19.22
## Mean :34.28 Mean :21.780 Mean : 8.930 Mean :21.56
## 3rd Qu.:39.28 3rd Qu.:24.108 3rd Qu.:10.535 3rd Qu.:25.52
## Max. :55.62 Max. :38.667 Max. :21.397 Max. :78.53
##
## ForeignBornPct Net_International_Migration_Rate_2010_2019
## Min. : 0.000 Min. :-1.2450
## 1st Qu.: 1.346 1st Qu.: 0.0890
## Median : 2.711 Median : 0.3820
## Mean : 4.681 Mean : 0.8748
## 3rd Qu.: 5.671 3rd Qu.: 1.0310
## Max. :53.254 Max. :20.4030
##
## NetMigrationRate1019 NaturalChangeRate1019 TotalPopEst2019
## Min. :-32.17900 Min. :-11.0250 Min. : 169
## 1st Qu.: -4.08100 1st Qu.: -1.4230 1st Qu.: 11131
## Median : -1.18400 Median : 0.4920 Median : 26118
## Mean : 0.01595 Mean : 0.9355 Mean : 105113
## 3rd Qu.: 3.15200 3rd Qu.: 2.8640 3rd Qu.: 68238
## Max. :115.58000 Max. : 23.0850 Max. :10039107
##
## WhiteNonHispanicPct2010 NativeAmericanNonHispanicPct2010
## Min. : 2.80 Min. : 0.00
## 1st Qu.:67.29 1st Qu.: 0.19
## Median :85.94 Median : 0.30
## Mean :78.62 Mean : 1.59
## 3rd Qu.:94.27 3rd Qu.: 0.61
## Max. :99.16 Max. :94.10
##
## BlackNonHispanicPct2010 AsianNonHispanicPct2010 HispanicPct2010
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.410 1st Qu.: 0.270 1st Qu.: 1.590
## Median : 1.940 Median : 0.460 Median : 3.290
## Mean : 8.842 Mean : 1.063 Mean : 8.329
## 3rd Qu.:10.020 3rd Qu.: 0.970 3rd Qu.: 8.290
## Max. :85.440 Max. :33.000 Max. :95.740
##
## Type_2015_Update RuralUrbanContinuumCode2013 UrbanInfluenceCode2013
## Min. :0.000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.: 2.000
## Median :1.000 Median :6.000 Median : 5.000
## Mean :1.792 Mean :4.987 Mean : 5.225
## 3rd Qu.:3.000 3rd Qu.:7.000 3rd Qu.: 8.000
## Max. :5.000 Max. :9.000 Max. :12.000
## NA's :1 NA's :1 NA's :1
## Perpov_1980_0711 HiCreativeClass2000 HiAmenity
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1129 Mean :0.2485 Mean :0.2498
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :1 NA's :2 NA's :3
## Retirement_Destination_2015_Update
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1416
## 3rd Qu.:0.0000
## Max. :1.0000
## NA's :1
We map a few key variables of interest. This gives us an intuitive understanding of the geographic distribution of the county population with different social and economical characteristics, as well as the infection and death rates over the counties in the continental US. We include both state and county level maps.
In the first map above, we show the total number of deaths by state. In the second, we show the death rate (number of deaths per 100,000 people) by state. There is a clear difference between the two: In the first, New York appears to have the most deaths, but in the second, we see that New Jersey actually has proportionally more. The death rate is the more appropriate variable to look at here because takes the total population of a state into account.
Here we have an interactive map on the racial composition of state populations. The reader can see the racial composition of different US continental states on the map by hovering their mouse over the state.
Next is an interactive map on the education levels for US states.
Below is a map for the per capita income of the US states.
And now we map the infection and death rates by state.
These next two maps are more detailed. They give information on number and rate of deaths at the county level, allowing us to see variation in the death rate of counties within the same state.
The following is a map on infection rate (number of infected per 100,000 people).
The graph below is an interactive map that gives the FIPS and death rate of each county.
This is an interactive map for population density at the county level.